Building a Corpus from Handwritten Picture Postcards: Transcription, Annotation and Part-of-Speech Tagging
نویسندگان
چکیده
In this paper, we present a corpus of over 11,000 holiday picture postcards written in German and Swiss German. We discuss the processes of digitalization, transcription, manual annotation and the development of the automatic text segmentation and part-of-speech tagging. Posted at the Zurich Open Repository and Archive, University of Zurich ZORA URL: https://doi.org/10.5167/uzh-149510 Published Version Originally published at: Sugisaki, Kyoko; Nicolas, Wiedmer; Heiko, Hausendorf (2018). Building a Corpus from Handwritten Picture Postcards: Transcription, Annotation and Part-of-Speech Tagging. In: 11th edition of the Language Resources and Evaluation Conference, Miyazaki, Japan, May 2018 May 2018. Building a Corpus from Handwritten Picture Postcards: Transcription, Annotation and Part-of-Speech Tagging Kyoko Sugisaki, Nicolas Wiedmer, Heiko Hausendorf German department, University of Zurich Schönberggasse 9, 8001 Zurich, Switzerland {sugisaki,nicolas.wiedmer,heiko.hausendorf}@ds.uzh.ch
منابع مشابه
The Specification of POS Tagging of the Hong Kong University Cantonese Corpus
The Hong Kong University Cantonese Corpus was collected from transcribed spontaneous speech in conversations and radio programs that involved two to four people. It was wordsegmented, annotated with Cantonese pronunciation, and recently tagged with word classes by adopting the parts-of-speech (POS) scheme of Yu et al. (2002). This scheme, which was designed for tagging written Mandarin texts, e...
متن کاملFrom Detecting Errors to Automatically Correcting Them
Faced with the problem of annotation errors in part-of-speech (POS) annotated corpora, we develop a method for automatically correcting such errors. Building on top of a successful error detection method, we first try correcting a corpus using two off-the-shelf POS taggers, based on the idea that they enforce consistency; with this, we find some improvement. After some discussion of the tagging...
متن کاملبرچسبگذاری ادات سخن زبان فارسی با استفاده از مدل شبکۀ فازی
Part of speech tagging (POS tagging) is an ongoing research in natural language processing (NLP) applications. The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. Parts of speech are also known as word classes or lexical categories. The purpose of POS tagging is determining the grammatical ...
متن کاملThe goo300k corpus of historical Slovene
The paper presents a gold-standard reference corpus of historical Slovene containing 1,000 sampled pages from over 80 texts, which were, for the most part, written between 1750 – 1900. Each page of the transcription has an associated facsimile and the words in the texts have been manually annotated with their modern-day equivalent, lemma and part-of-speech. The paper presents the structure of t...
متن کاملA pilot study for a Corpus of Dutch Aphasic Speech (CoDAS)
In this paper, a pilot study for the development of a corpus of Dutch Aphasic Speech (CoDAS) is presented. Given the lack of resources of this kind not only for Dutch but also for other languages, CoDAS will be able to set standards and will contribute to the future research in this area. We have established the basic requirements with respect to text types, metadata, and annotation levels that...
متن کامل